FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

نویسندگان

  • Carsten Weinhold
  • Adam Lackorzynski
  • Jan Bierbaum
  • Martin Küttler
  • Maksym Planeta
  • Hermann Härtig
  • Amnon Shiloh
  • Ely Levy
  • Tal Ben-Nun
  • Amnon Barak
  • Thomas Steinke
  • Thorsten Schütt
  • Jan Fajerski
  • Alexander Reinefeld
  • Matthias Lieber
  • Wolfgang E. Nagel
چکیده

SPPEXA: ESSEX / GROMEX Gerhard Wellein / Ivo Kabadshow Highly Adaptive EnergyEfficient Computing SFB912 ASTEROID SPP1500 Gernot Heiser UNSW, NICTA Vijay Saraswat IBM Research Zürich, X10 Torsten Hoefler ETH Zurich Michael Bussmann Helmholtz Zentrum Dresden Rossendorf Eric Van Hensbergen IBM Research Austin DARPA HPCS, FastOS, X-Stack Frank Mueller North Carolina State University Phase 1 Results: Summary ■ First L4-based prototype ■ Several source-compatible MPI applications ported ■ Tested on small island of real HPC cluster ■ Gossip scalability and resilience modeled, simulated, and measured ■ Erasure-coded in-memory checkpoints with XtreemFS, tested on Cray XC40 ■ 2 SPPEXA Workshops

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault Tolerant DNA Computing Based on ‎Digital Microfluidic Biochips

   Historically, DNA molecules have been known as the building blocks of life, later on in 1994, Leonard Adelman introduced a technique to utilize DNA molecules for a new kind of computation. According to the massive parallelism, huge storage capacity and the ability of using the DNA molecules inside the living tissue, this type of computation is applied in many application areas such as me...

متن کامل

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...

متن کامل

Multiscale computing in the exascale era

We expect that multiscale simulations will be one of the main high performance computing workloads in the exascale era. We propose multiscale computing patterns as a generic vehicle to realise load balanced, fault tolerant and energy aware high performance multiscale computing. Multiscale computing patterns should lead to a separation of concerns, whereby application developers can compose mult...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Fault-tolerant servers for the RHODOS system

Providing reliable services is one of the primary goals in designing a distributed operating system. Nowadays, we have seen a trend in distributed operating system design to shift from large kernel architectures or even monolithic architectures to microkernel architectures supported by the client/server model. This means that a lot of services of an operating system originally provided by the m...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016